library(tidyverse)
library(readxl) # library for reading data from excel
library(naniar) # library for handling missing data
library(ggplot2) # library for graph
library(jsonlite) # library for reading json file
library(magick) # image processing took kitGame of Thrones character deaths and killers
Group 5: Li-Yen Huang (lh36263), Kai Ng (yn2979)
Overview
In our project analyzing Game of Thrones character deaths, we aim to gain insights into the power dynamics within the series by studying the relationships between characters, the houses they belong to, and the individuals responsible for the killings. Some goals that we would like to achieve with our project are identifying patterns in character deaths throughout the series and identifying patterns in internal and external conflicts between the houses of Westeros.
Goal
The project focuses on discovering interesting insights about the deaths in Game of Thrones through data analysis. We address the following questions:
Task1: How do the death of prominent characters influence the ratings?
Task2: Which regions are targeted most often?
We will visualize our findings to present the information in a clear and engaging way for users.
Workflow
We started by downloading raw data from various sources and importing it into RStudio/Jupyter. Since the data came in multiple formats—CSV, JSON, and XLSX—we used proper functions to load the data into the project database. Next, we cleaned and formatted the data, and further validated the data. Then, we merged them into a single, unified data frame. We then queried this data frame to extract relevant information for visualizing and identifying trends.
Database Design
Episode - Death: The episode and death data sets have a shared common columns: season and episode. This allows us to map each episode to the on-screen deaths that occurred within it.
Death - Character: Ideally, there should be a one-to-one mapping between the death data set and the character data set. However, we lacked sufficient data to know clear relationships in all cases. Some characters in the show are unnamed, and the data set uses generic terms like “Stark Soldier” to refer to them, making the mapping relationship ambiguous. As a result, we excluded entries where the mapping between the two data sets was uncertain.
Death - Location/Region - Map Location: We planned to link death locations from the death data set to corresponding entries in the map location data set. However, the names did not have a direct one-to-one relationship. For example, “Castle Black” (death location) is part of “The Wall” (map location), but we lacked sufficient data to make these connections. To address this, we used AI to generate a look-up table. Since the AI-generated data might not be fully reliable, we manually verified a sample of the entries. Based on these checks, we implemented the look-up table to help us join the data sets.
Data Sources
We used five data sources, one of which was generated by AI. The remaining data were compiled by three individual fans: David Murphy, Mohammad Reza Ghari, and Jeffery Lancaster.
Death data
Description: This data set documents every on-screen death in Game of Thrones. It includes information on who killed whom, the method applied, the location, and the episode of death. The author sets specific criteria to determine what qualifies as an on-screen death, and more details are provided in the link.
Source: The data is collected by David Murphy.
Link: https://data.world/datasaurusrex/game-of-thones-deaths
Challenges:
Killers without houses: We identified that some killers do not belong to any houses. Initially, we considered replacing this with ‘N/A.’ However, we recognized that it is reasonable for some killers not to be associated with any houses. We decided to leave it “None”.
Repeated names: Some death names are repeated. This is because certain characters in the show do not have specific names and are referred to by general terms, such as “Tribesman” or “Wildling.” Despite this, each entry still represents a single death. We decided to retain these entries in the table.
Episode data
Description: This data set provides detailed episode-level data for Game of Thrones. It includes key information such as season and episode numbers, titles, release dates, IMDb ratings, vote counts, reviews, and key contributors like writers and actors.
Source: The data is collected by Mohammad Reza Ghari. He scraped the data from IMDB & Wikipedia.
Link: https://www.kaggle.com/datasets/rezaghari/game-of-thrones
Challenges:
- Non-standard date format: The data uses the “12-Apr-2004” format, which is a non-standard date format in R’s tidyverse. However, R can still recognize this format, but it requires using the as.Date() function to convert it into a proper date object.
Location data
Description: This data set contains geographical coordinates for various regions on the Ice and Fire world map.
Source: The data set is compiled by Jeffery Lancaster, who sourced information from the Game of Thrones Wiki and manually encoded each location.
URL: https://github.com/jeffreylancaster/game-of-thrones/blob/master/data/opening-locations.json
Challenges:
- Non-rectangular data: The location data comes from a JSON file, which is not a rectangular data. It consists of two key-value pairs: “note” and “locations.” Fortunately, the “locations” data itself is in a rectangular format, making it easy to extract and use.
Characters data
Description: This data set provides detailed insights into the Game of Thrones characters, including the total number of episodes they appeared in, along with their first and final appearance years on the show.
Source: The data is collected by Mohammad Reza Ghari. He scraped the data from IMDB & Wikipedia.
URL: https://www.kaggle.com/datasets/rezaghari/game-of-thrones?select=characters_v4.csv
Challenges:
- Repeated characters: The character data includes repeated entries for unnamed characters, such as “Stark Soldier.” Since these generic terms represent distinct individuals in the show, we decided to retain all entries in the data set.
Look-up table for mapping locations and regions
Description: The mapping relationship between death locations and map regions.
Source: Due to the unavailability of look-up table between death locations and map regions, we generated this data using the AI tool, Perplexity. We selected Perplexity because it can use web searches to generate answers and provides references for its responses, ensuring greater accuracy and up-to-date information.
*Note: Since the prompting process relies on the location and death data, it will be explained in detail in the data importing section.
Challenges:
Refine AI Answer: While AI is a powerful tool for accessing data we may lack, it can sometimes produce inaccurate results. To improve accuracy, we initially told Perplexity, “If a location doesn’t match any listed regions, leave it blank.” However, the output only matched locations to regions when their names were nearly identical, leaving many entries blank. This approach did not produce good results. We then revised our prompt, asking the AI to infer the best matches based on its knowledge. To verify these results, we manually reviewed a sample of matches by searching for location names online and referring to sources such as the Wiki of Westeros fandom site. The results seemed to fit well. Although this method isn’t entirely reliable, the answer seemed good enough to meet our needs.
First Prompt Refined Prompt Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not belong to any of the listed regions, leave it blank.
……
Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not have a direct one-to-one mapping, use your best knowledge to infer its region.
……
Result:
Result:
Import Data
Environment Setup
At first, we imported the necessary libraries for our project.
library(tidyverse)
library(readxl)
library(lubridate)
library(jsonlite)Connecting to DuckDB
Then, we loaded the DuckDB library to connect to the local DuckDB database.
library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)
# create / connect to database file
drv <- duckdb()
con <- dbConnect(drv)Install the reticulate package to enable running Python code in R.
library(reticulate)
py_install("pandas")
py_install("matplotlib")
py_install("requests")
py_install("openpyxl")import pandas as pd
from io import StringIO
import re
import requests
from PIL import Image, ImageEnhance
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as npImport Death Data
We loaded data from the ‘Game of Thrones Deaths.xlsx’ file. To avoid the program automatically inferring data types and potentially introducing ambiguity, we explicitly specified the col_types parameter in read_excel. Furthermore, we renamed the columns to maintain consistent and uniform header naming conventions.
death_data_type = c("text", "text", "numeric", "numeric", "text", "text", "text", "text", "numeric")
got_deaths = read_excel( "data/Game of Thrones Deaths.xlsx", col_types = death_data_type ) |>
janitor::clean_names()
got_deathsValidation Queries
# Check row count
got_deaths |>
summarize(row_count = n())# Check data types
got_deaths |>
glimpse()Rows: 2,224
Columns: 9
$ name <chr> "Waymar Royce", "Gared", "Will", "Stag", "Direwolf", "Jo…
$ allegiance <chr> "Night's Watch", "Night's Watch", "Night's Watch", "None…
$ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 5, 5, 5, 5, 5, 5, 5, 5,…
$ location <chr> "Beyond the Wall", "Beyond the Wall", "Winterfell", "Win…
$ killer <chr> "White Walker", "White Walker", "Ned Stark", "Direwolf",…
$ killers_house <chr> "None", "None", "House Stark", "None", "None", "House Ar…
$ method <chr> "Ice sword", "Ice sword", "Sword", "Teeth", "Antler", "P…
$ death_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
# Check missing values
got_deaths |>
summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))# Check for duplicates
got_deaths |>
summarize(duplicate_count = sum(duplicated(got_deaths)))We loaded data from the ‘Game of Thrones Deaths.xlsx’ file using the st_read function. This method directly imported the data into our DuckDB database, creating the raw_deaths table.
--Extenstion to read data from an Excel file
INSTALL spatial;
LOAD spatial;
CREATE OR REPLACE TEMP TABLE raw_deaths AS
SELECT * FROM st_read('data/Game of Thrones Deaths.xlsx');To create the got_deaths table, we transformed the original raw_deaths dataset by normalizing key fields and ensuring consistency in data types. We generated normalized columns (normalized_name, normalized_allegiance, and normalized_location) by converting text to lowercase and trimming any extra whitespace, which helps standardize the data for efficient joins and comparisons. We also cast several fields to appropriate data types, such as season and episode to INTEGER, and other descriptive fields like name, allegiance, and method to TEXT.
CREATE OR REPLACE TEMP TABLE got_deaths AS
SELECT
LOWER(TRIM(name)) AS normalized_name,
LOWER(TRIM(allegiance)) AS normalized_allegiance,
CAST(season AS INTEGER) AS season,
CAST(episode AS INTEGER) AS episode,
LOWER(TRIM(location)) AS normalized_location,
CAST(killer AS TEXT) AS killer,
CAST("killers house" AS TEXT) AS killers_house,
CAST(method AS TEXT) AS method,
CAST("death no." AS INTEGER) AS death_no
FROM raw_deaths;Validation Queries
-- Check row count
SELECT COUNT(*) AS row_count
FROM got_deaths;| row_count |
|---|
| 2796 |
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_deaths';| column_name | data_type | is_nullable |
|---|---|---|
| normalized_name | VARCHAR | YES |
| normalized_allegiance | VARCHAR | YES |
| season | INTEGER | YES |
| episode | INTEGER | YES |
| normalized_location | VARCHAR | YES |
| killer | VARCHAR | YES |
| killers_house | VARCHAR | YES |
| method | VARCHAR | YES |
| death_no | INTEGER | YES |
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_deaths'
GROUP BY column_name;| column_name | missing_count |
|---|---|
| normalized_name | 0 |
| killers_house | 0 |
| normalized_allegiance | 0 |
| season | 0 |
| episode | 0 |
| killer | 0 |
| normalized_location | 0 |
| death_no | 0 |
| method | 0 |
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (normalized_name, normalized_allegiance, season, episode, normalized_location, killer, killers_house, method, death_no)) AS duplicate_count
FROM got_deaths;
--There are duplicates due to a general character appearing multiple times in a single episode or a character was played by different actors in an episode| duplicate_count |
|---|
| 571 |
Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_deaths;| normalized_name | normalized_allegiance | season | episode | normalized_location | killer | killers_house | method | death_no |
|---|---|---|---|---|---|---|---|---|
| waymar royce | night’s watch | 1 | 1 | beyond the wall | White Walker | None | Ice sword | 1 |
| gared | night’s watch | 1 | 1 | beyond the wall | White Walker | None | Ice sword | 2 |
| will | night’s watch | 1 | 1 | winterfell | Ned Stark | House Stark | Sword | 3 |
| stag | none | 1 | 1 | winterfell | Direwolf | None | Teeth | 4 |
| direwolf | none | 1 | 1 | winterfell | Stag | None | Antler | 5 |
| jon arryn | house arryn | 1 | 1 | king’s landing | Lysa Arryn | House Arryn | Poison | 6 |
| dothraki man | dothraki | 1 | 1 | pentos | Dothraki man | Dothraki | Arakh | 7 |
| catspaw assassin | none | 1 | 2 | winterfell | Summer | House Stark | Teeth | 8 |
| mycah | smallfolk | 1 | 2 | kingsroad | Sandor “the Hound” Clegane | House Lannister | Sword | 9 |
| lady | house stark | 1 | 2 | kingsroad | Ned Stark | House Stark | Knife | 10 |
We loaded data from the ‘Game of Thrones Deaths.xlsx’ file. To avoid Python automatically inferring data types and potentially introducing ambiguity, we explicitly specified the dtype parameter in read_excel. Furthermore, we renamed the columns to maintain consistent and uniform header naming conventions.
death_type_spec = {
'Name': 'string',
'Allegiance': 'string',
'Season': 'int32',
'Episode': 'int32',
'Location': 'string',
'Killer': 'string',
'Killers House': 'string',
'Method': 'string',
'Death No.': 'int32'
}
# Read data from excel file
raw_deaths = pd.read_excel(open('data/Game of Thrones Deaths.xlsx', 'rb'), sheet_name = 'Game of Thrones Deaths collecti', dtype = death_type_spec)
# Clean the column names: convert to lowercase, replace spaces with '_', and remove special characters
got_deaths = (
raw_deaths.rename(columns = str.lower)
.rename(columns = lambda x: x.replace( ' ', '_' ))
.rename(columns = lambda x: re.sub('[^a-z0-9_]', '', x ))
)
got_deaths name ... death_no
0 Waymar Royce ... 1
1 Gared ... 2
2 Will ... 3
3 Stag ... 4
4 Direwolf ... 5
... ... ... ...
2219 Gregor "the Mountain" Clegane ... 2220
2220 Sandor "the Hound" Clegane ... 2221
2221 Jaime Lannister ... 2222
2222 Cersei Lannister ... 2223
2223 Daenerys Targaryen ... 2224
[2224 rows x 9 columns]
Validation Queries
# Check row count
len(got_deaths)2224
# Check data types for each column
got_deaths.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 2224 non-null string
1 allegiance 1792 non-null string
2 season 2224 non-null int32
3 episode 2224 non-null int32
4 location 2224 non-null string
5 killer 2214 non-null string
6 killers_house 2194 non-null string
7 method 2224 non-null string
8 death_no 2224 non-null int32
dtypes: int32(3), string(6)
memory usage: 130.4 KB
# Check missing values
got_deaths.isnull().sum()name 0
allegiance 432
season 0
episode 0
location 0
killer 10
killers_house 30
method 0
death_no 0
dtype: int64
# Check for duplicates
got_deaths.duplicated().sum()np.int64(0)
Note: The validation results for missing values show slight differences in Python. This is because Python’s isnull() function treats None as a null value.
Import Episode Data
The episode data is imported from the ‘GOT_episodes_v4.csv’ file. Following the same process as with the death data, we explicitly specified the data types for tidyverse to prevent inaccurate type guessing. Additionally, we formatted date strings, such as “20-Apr-2014,” into date types.
episode_data_col_spec = cols(
Season = col_double(),
Episode = col_double(),
Title = col_character(),
Release_date = col_character(),
Rating = col_double(),
Votes = col_double(),
Summary = col_character(),
Writer_1 = col_character(),
Writer_2 = col_character(),
Star_1 = col_character(),
Star_2 = col_character(),
Star_3 = col_character(),
Users_reviews = col_double(),
Critics_reviews = col_double(),
US_Viewers = col_double(),
Duration = col_double(),
Director = col_character(),
Budget_estimate = col_double()
)
raw_episodes = read_csv("data/GOT_episodes_v4.csv", col_types = episode_data_col_spec) |>
janitor::clean_names()
got_episodes <- raw_episodes |>
mutate( release_date = as.Date(release_date, format = "%d-%b-%y") )
got_episodesValidation Queries
# Check row count
raw_episodes |>
summarize(row_count_match = n() == nrow(got_episodes))# Check data types
got_episodes |>
glimpse()Rows: 73
Columns: 18
$ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ episode <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8,…
$ title <chr> "Winter Is Coming", "The Kingsroad", "Lord Snow", "Cri…
$ release_date <date> 2011-04-17, 2011-04-24, 2011-05-01, 2011-05-08, 2011-…
$ rating <dbl> 9.1, 8.8, 8.7, 8.8, 9.1, 9.2, 9.2, 9.0, 9.6, 9.5, 8.8,…
$ votes <dbl> 38639, 29285, 27694, 26284, 27349, 27079, 27556, 25645…
$ summary <chr> "Eddard Stark is torn between his family and an old fr…
$ writer_1 <chr> "David Benioff", "David Benioff", "David Benioff", "Da…
$ writer_2 <chr> "D.B. Weiss", "D.B. Weiss", "D.B. Weiss", "D.B. Weiss"…
$ star_1 <chr> "Sean Bean", "Sean Bean", "Sean Bean", "Sean Bean", "S…
$ star_2 <chr> "Mark Addy", "Mark Addy", "Mark Addy", "Mark Addy", "M…
$ star_3 <chr> "Nikolaj Coster-Waldau", "Nikolaj Coster-Waldau", "Nik…
$ users_reviews <dbl> 61, 27, 21, 22, 24, 23, 21, 20, 30, 34, 22, 19, 15, 19…
$ critics_reviews <dbl> 30, 28, 29, 26, 26, 26, 27, 26, 29, 29, 29, 25, 25, 24…
$ us_viewers <dbl> 2.22, 2.20, 2.44, 2.45, 2.58, 2.44, 2.40, 2.72, 2.66, …
$ duration <dbl> 62, 56, 58, 56, 55, 53, 58, 59, 57, 53, 53, 54, 53, 51…
$ director <chr> "Timothy Van Patten", "Timothy Van Patten", "Brian Kir…
$ budget_estimate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Check missing values
got_episodes |>
summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))#Check for duplicates
raw_episodes |>
summarize(duplicate_count = sum(duplicated(raw_episodes)))The episode data is imported from the ‘GOT_episodes_v4.csv’ file by directly reading the data file into our DuckDB database, creating the raw_episodes table.
CREATE OR REPLACE TEMP TABLE raw_episodes AS
SELECT * FROM 'data/GOT_episodes_v4.csv';We then created the got_episodes table to transform the original raw_episodes dataset into a more structured format. We casted several columns to their appropriate data types: season and episode to INTEGER, and rating, us_viewers, and budget_estimate to FLOAT. We also normalized text fields, such as title, summary, writers, stars, and director by converting them to lowercase for consistency. The release_date was processed using STRPTIME to convert it into a proper date format.
CREATE OR REPLACE TEMP TABLE got_episodes AS
SELECT
CAST(season AS INTEGER) AS season,
CAST(episode AS INTEGER) AS episode,
LOWER(title) AS title,
STRPTIME(release_date, '%d-%b-%y') AS release_date,
CAST(rating AS FLOAT) AS rating,
CAST(votes AS INTEGER) AS votes,
LOWER(summary) AS summary,
LOWER(writer_1) AS writer_1,
LOWER(writer_2) AS writer_2,
LOWER(star_1) AS star_1,
LOWER(star_2) AS star_2,
LOWER(star_3) AS star_3,
CAST(users_reviews AS INTEGER) AS users_reviews,
CAST(critics_reviews AS INTEGER) AS critics_reviews,
CAST(us_viewers AS FLOAT) AS us_viewers,
CAST(duration AS FLOAT) AS duration,
LOWER(director) AS director,
CAST(budget_estimate AS FLOAT) AS budget_estimate
FROM raw_episodes;Validation Queries
-- Check row count
SELECT COUNT(*) AS row_count
FROM got_episodes;| row_count |
|---|
| 73 |
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_episodes';| column_name | data_type | is_nullable |
|---|---|---|
| season | INTEGER | YES |
| episode | INTEGER | YES |
| title | VARCHAR | YES |
| release_date | TIMESTAMP | YES |
| rating | FLOAT | YES |
| votes | INTEGER | YES |
| summary | VARCHAR | YES |
| writer_1 | VARCHAR | YES |
| writer_2 | VARCHAR | YES |
| star_1 | VARCHAR | YES |
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_episodes'
GROUP BY column_name;| column_name | missing_count |
|---|---|
| summary | 0 |
| star_2 | 0 |
| star_3 | 0 |
| writer_1 | 0 |
| director | 0 |
| title | 0 |
| release_date | 0 |
| votes | 0 |
| duration | 0 |
| season | 0 |
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (season, episode, title, release_date, rating, votes, summary, writer_1, writer_2, star_1, star_2, star_3, users_reviews, critics_reviews, us_viewers, duration, director, budget_estimate)) AS duplicate_count
FROM got_episodes;| duplicate_count |
|---|
| 0 |
Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_episodes;| season | episode | title | release_date | rating | votes | summary | writer_1 | writer_2 | star_1 | star_2 | star_3 | users_reviews | critics_reviews | us_viewers | duration | director | budget_estimate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | winter is coming | 2011-04-17 | 9.1 | 38639 | eddard stark is torn between his family and an old friend when asked to serve at the side of king robert baratheon; viserys plans to wed his sister to a nomadic warlord in exchange for an army. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 61 | 30 | 2.22 | 62 | timothy van patten | 0 |
| 1 | 2 | the kingsroad | 2011-04-24 | 8.8 | 29285 | while bran recovers from his fall, ned takes only his daughters to king’s landing. jon snow goes with his uncle benjen to the wall. tyrion joins them. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 27 | 28 | 2.20 | 56 | timothy van patten | 0 |
| 1 | 3 | lord snow | 2011-05-01 | 8.7 | 27694 | jon begins his training with the night’s watch; ned confronts his past and future at king’s landing; daenerys finds herself at odds with viserys. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 21 | 29 | 2.44 | 58 | brian kirk | 0 |
| 1 | 4 | cripples, bastards, and broken things | 2011-05-08 | 8.8 | 26284 | eddard investigates jon arryn’s murder. jon befriends samwell tarly, a coward who has come to join the night’s watch. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 22 | 26 | 2.45 | 56 | brian kirk | 0 |
| 1 | 5 | the wolf and the lion | 2011-05-15 | 9.1 | 27349 | catelyn has captured tyrion and plans to bring him to her sister, lysa arryn, at the vale, to be tried for his, supposed, crimes against bran. robert plans to have daenerys killed, but eddard refuses to be a part of it and quits. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 24 | 26 | 2.58 | 55 | brian kirk | 0 |
| 1 | 6 | a golden crown | 2011-05-22 | 9.2 | 27079 | while recovering from his battle with jaime, eddard is forced to run the kingdom while robert goes hunting. tyrion demands a trial by combat for his freedom. viserys is losing his patience with drogo. | david benioff | d.b. weiss | sean bean | mark addy | michelle fairley | 23 | 26 | 2.44 | 53 | daniel minahan | 0 |
| 1 | 7 | you win or you die | 2011-05-29 | 9.2 | 27556 | robert has been injured while hunting and is dying. jon and the others finally take their vows to the night’s watch. a man, sent by robert, is captured for trying to poison daenerys. furious, drogo vows to attack the seven kingdoms. | david benioff | d.b. weiss | sean bean | mark addy | nikolaj coster-waldau | 21 | 27 | 2.40 | 58 | daniel minahan | 0 |
| 1 | 8 | the pointy end | 2011-06-05 | 9.0 | 25645 | the lannisters press their advantage over the starks; robb rallies his father’s northern allies and heads south to war; the white walkers attack the wall; tyrion returns to his father with some new friends. | david benioff | d.b. weiss | sean bean | michelle fairley | lena headey | 20 | 26 | 2.72 | 59 | daniel minahan | 0 |
| 1 | 9 | baelor | 2011-06-12 | 9.6 | 36217 | robb goes to war against the lannisters. jon finds himself struggling on deciding if his place is with robb or the night’s watch. drogo has fallen ill from a fresh battle wound. daenerys is desperate to save him. | david benioff | d.b. weiss | sean bean | michelle fairley | nikolaj coster-waldau | 30 | 29 | 2.66 | 57 | alan taylor | 0 |
| 1 | 10 | fire and blood | 2011-06-19 | 9.5 | 31789 | robb vows to get revenge on the lannisters. jon must officially decide if his place is with robb or the night’s watch. daenerys says her final goodbye to drogo. | david benioff | d.b. weiss | sean bean | michelle fairley | nikolaj coster-waldau | 34 | 29 | 3.04 | 53 | alan taylor | 0 |
The episode data is imported from the ‘GOT_episodes_v4.csv’ file. Following the same process as with the death data, we explicitly specified the dtype to prevent inaccurate type guessing. Additionally, we formatted date strings, such as “20-Apr-2014,” into date types.
episode_type_spec = {
'Season': 'int32',
'Episode': 'int32',
'Title': 'string',
'Release_date': 'string',
'Rating': 'float64',
'Votes': 'int32',
'Summary': 'string',
'Writer_1': 'string',
'Writer_2': 'string',
'Star_1': 'string',
'Star_2': 'string',
'Star_3': 'string',
'Users_reviews': 'int32',
'Critics_reviews': 'int32',
'US_Viewers': 'float64',
'Duration': 'float64',
'Director': 'string',
'Budget_estimate': 'float64'
}
# Read the CSV file
raw_episodes = pd.read_csv(
"data/GOT_episodes_v4.csv",
dtype=episode_type_spec
)
got_episodes = (
raw_episodes.rename( columns=str.lower )
.assign( release_date=lambda df: pd.to_datetime( df['release_date'], format='%d-%b-%y' ) )
)
got_episodes season episode ... director budget_estimate
0 1 1 ... Timothy Van Patten 0.0
1 1 2 ... Timothy Van Patten 0.0
2 1 3 ... Brian Kirk 0.0
3 1 4 ... Brian Kirk 0.0
4 1 5 ... Brian Kirk 0.0
.. ... ... ... ... ...
68 8 2 ... David Nutter 0.0
69 8 3 ... Miguel Sapochnik 0.0
70 8 4 ... David Nutter 0.0
71 8 5 ... Miguel Sapochnik 0.0
72 8 6 ... David Benioff 0.0
[73 rows x 18 columns]
Validation Queries
# Check data types for each column
got_episodes.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73 entries, 0 to 72
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 season 73 non-null int32
1 episode 73 non-null int32
2 title 73 non-null string
3 release_date 73 non-null datetime64[ns]
4 rating 73 non-null float64
5 votes 73 non-null int32
6 summary 73 non-null string
7 writer_1 73 non-null string
8 writer_2 73 non-null string
9 star_1 73 non-null string
10 star_2 73 non-null string
11 star_3 73 non-null string
12 users_reviews 73 non-null int32
13 critics_reviews 73 non-null int32
14 us_viewers 73 non-null float64
15 duration 73 non-null float64
16 director 73 non-null string
17 budget_estimate 73 non-null float64
dtypes: datetime64[ns](1), float64(4), int32(5), string(8)
memory usage: 9.0 KB
# Check missing values
got_episodes.isnull().sum()season 0
episode 0
title 0
release_date 0
rating 0
votes 0
summary 0
writer_1 0
writer_2 0
star_1 0
star_2 0
star_3 0
users_reviews 0
critics_reviews 0
us_viewers 0
duration 0
director 0
budget_estimate 0
dtype: int64
# Check for duplicates
(got_episodes.groupby(['season', 'episode'])
.size()
.reset_index(name='count')
.query('count > 1'))Empty DataFrame
Columns: [season, episode, count]
Index: []
Import Map Location Data
The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually. From the note key-value pair, we extracted the URL of the map image using a str_extract() with a regex pattern. For the locations key-value pair, we extracted the data into a location data frame. We then further examined this data frame to ensure it is in a rectangular data structure.
raw_map_data <- fromJSON("data/opening-locations.json")
got_map_url <- raw_map_data$note |>
str_extract( "https://[^ ]+" )
got_map_url[1] "https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523"
# Transform data into a rectangular format
tibble(locations = raw_map_data$locations) |>
unnest_wider(locations, ptype = list(
name = character(),
fx = numeric(),
fy = numeric()
)) -> got_locations
got_locationsValidation Queries
# Ensure `raw_map_data` is a data frame
raw_map_data <- as.data.frame(raw_map_data)
# Check row count
raw_map_data |>
summarize(row_count_match = n() == nrow(got_locations))# Check data types
got_locations |>
glimpse()Rows: 22
Columns: 3
$ name <chr> "Astapor", "Braavos", "Dorne", "Dragonstone", "Dreadfort", "Eastw…
$ fx <dbl> 19.196, 10.342, 8.416, 8.311, 7.280, 7.395, 6.218, 6.697, 7.025, …
$ fy <dbl> 17.342, 9.363, 16.529, 11.967, 5.999, 3.822, 11.519, 12.759, 4.91…
# Check missing values
got_locations |>
summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))# Check for duplicates
got_locations |>
summarize(duplicate_count = sum(duplicated(got_locations)))# Check if the location dataframe is rectangular
# We verify whether the number of each column has the same number of rows
got_locations |>
summarize(is_rectangular = all(across(everything(), ~ length(.) == nrow(got_locations))))The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually by first loading the JSON data into a table named raw_map_data using the read_json_auto function. The locations field, which is an array of nested JSON objects, was then flattened using the UNNEST function to create individual rows for each object in the array. The ->> operator was used to extract specific values (name, fx, and fy) from the JSON objects as plain text. These extracted fields were then transformed and stored in a new table named locations. The notes field was directly extracted into another table named notes.
--Extension to read data from a JSON file
INSTALL json;
LOAD json;
CREATE OR REPLACE TEMP TABLE raw_map_data AS
SELECT *
FROM read_json_auto('data/opening-locations.json');
CREATE OR REPLACE TEMP TABLE locations AS
SELECT
location.value->>'name' AS name, --Extracts the 'name' field as text
CAST(location.value->>'fx' AS DOUBLE) AS fx, --Extracts 'fx' and casts it to DOUBLE
CAST(location.value->>'fy' AS DOUBLE) AS fy --Extracts 'fy' and casts it to DOUBLE
FROM raw_map_data,
UNNEST(locations) AS location(value);
CREATE OR REPLACE TEMP TABLE notes AS
SELECT note
FROM raw_map_data;Map URL
We created the got_map_url table to extract the map URL from the notes table. Using the REGEXP_EXTRACT function, we identified and extracted the URL pattern present in the note field.
CREATE OR REPLACE TEMP TABLE got_map_url AS
SELECT
REGEXP_EXTRACT(note, 'https://[^ ]+') AS map_url
FROM notes;Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_map_url;| map_url |
|---|
| https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523 |
Locations
Next, we created the got_locations table to store information about the map locations. We used LOWER(TRIM(name)) to create a normalized_name column, ensuring consistency by converting the location names to lowercase and removing any extra whitespace. Additionally, we cast the name as TEXT, and the coordinates (fx and fy) as FLOAT to represent loc_x and loc_y.
CREATE OR REPLACE TEMP TABLE got_locations AS
SELECT
LOWER(TRIM(name)) AS normalized_name,
CAST(fx AS FLOAT) AS loc_x,
CAST(fy AS FLOAT) AS loc_y
FROM locations;Validation Queries
-- Check row count
SELECT COUNT(*) AS row_count
FROM got_locations;| row_count |
|---|
| 22 |
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_locations';| column_name | data_type | is_nullable |
|---|---|---|
| normalized_name | VARCHAR | YES |
| loc_x | FLOAT | YES |
| loc_y | FLOAT | YES |
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_locations'
GROUP BY column_name;| column_name | missing_count |
|---|---|
| normalized_name | 0 |
| loc_y | 0 |
| loc_x | 0 |
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (normalized_name, loc_x, loc_y)) AS duplicate_count
FROM got_locations;| duplicate_count |
|---|
| 0 |
Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_locations;| normalized_name | loc_x | loc_y |
|---|---|---|
| astapor | 19.196 | 17.342 |
| braavos | 10.342 | 9.363 |
| dorne | 8.416 | 16.529 |
| dragonstone | 8.311 | 11.967 |
| dreadfort | 7.280 | 5.999 |
| eastwatch | 7.395 | 3.822 |
| harrenhal | 6.218 | 11.519 |
| king’s landing | 6.697 | 12.759 |
| last hearth | 7.025 | 4.911 |
| meereen | 19.935 | 15.592 |
The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually. From the note key-value pair, we extracted the URL of the map image using regex. For the locations key-value pair, we extracted the data into a location data frame. Since the locations contain a dictionary with multiple fields for each location, we used apply to access each dictionary’s fields individually for each row.
raw_map = pd.read_json('data/opening-locations.json')
# Retrieve map url from note
got_map_url = re.findall( 'https://[^ ]+', raw_map['note'][0] )
got_map_url['https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523']
got_locations = (
raw_map.assign(
name = lambda x: x['locations'].apply(lambda d: d['name']),
fx = lambda x: x['locations'].apply(lambda d: d['fx']),
fy = lambda x: x['locations'].apply(lambda d: d['fy'])
)
.drop( columns = ['note', 'locations'] )
)
got_locations name fx fy
0 Astapor 19.196 17.342
1 Braavos 10.342 9.363
2 Dorne 8.416 16.529
3 Dragonstone 8.311 11.967
4 Dreadfort 7.280 5.999
5 Eastwatch 7.395 3.822
6 Harrenhal 6.218 11.519
7 King's Landing 6.697 12.759
8 Last Hearth 7.025 4.911
9 Meereen 19.935 15.592
10 Moat Cailin 5.947 8.145
11 Oldtown 3.114 15.633
12 Pentos 10.395 12.728
13 Pyke 3.229 10.572
14 Qarth 25.580 18.685
15 Riverrun 5.051 11.061
16 The Eyrie 7.030 10.426
17 The Twins 5.260 9.811
18 The Wall 6.770 3.822
19 Vaes Dothrak 23.726 11.686
20 Winterfell 5.666 6.218
21 Yunkai 19.550 16.102
Validation Queries
# Check data types for each column
got_locations.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 22 non-null object
1 fx 22 non-null float64
2 fy 22 non-null float64
dtypes: float64(2), object(1)
memory usage: 656.0+ bytes
# Check missing values
got_locations.isnull().sum()name 0
fx 0
fy 0
dtype: int64
# Check for duplicates
got_locations.duplicated().sum()np.int64(0)
# Check if the location dataframe is rectangular
# We verify whether the number of each column has the same number of rows
(got_locations.apply(len)
.eq(len(got_locations))
.all())np.True_
Import Characters Data
The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below to explicitly set column types, preventing Tidyverse from guessing them. Additionally, we used the janitor::clean_names() function to clean and standardize the column names.
character_col_spec = cols(
Character = col_character(),
`Actor/ess` = col_character(),
Episodes_appeared = col_integer(),
First_appearance = col_integer(),
Last_appearance = col_integer()
)
got_characters <- read_csv( "data/characters_v4.csv", col_type = character_col_spec ) |>
janitor::clean_names()
got_charactersValidation Queries
#Check row count
got_characters |>
summarize(row_count = n())#Check data types
got_characters |>
glimpse()Rows: 832
Columns: 5
$ character <chr> "Tyrion Lannister", "Cersei Lannister", "Daenerys Ta…
$ actor_ess <chr> "Peter Dinklage", "Lena Headey", "Emilia Clarke", "K…
$ episodes_appeared <int> 67, 62, 62, 62, 59, 59, 55, 52, 48, 47, 46, 42, 42, …
$ first_appearance <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
$ last_appearance <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
#Check missing values
got_characters |>
summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))#Check for duplicates
got_characters |>
summarize(duplicate_count = sum(duplicated(got_characters)))The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below by directly reading the data file into our DuckDB database, creating the characters table.
CREATE OR REPLACE TEMP TABLE characters AS
SELECT * FROM 'data/characters_v4.csv';The got_characters table was created to transform and standardize the characters data for easier analysis. We created a normalized_character column by converting the character names to lowercase and trimming any extra whitespace, ensuring consistency when joining with other datasets. The character and actor_ess fields were also converted to lowercase for uniformity. Additionally, we cast numerical fields, such as episodes_appeared, first_appearance, and last_appearance, as INTEGER to ensure the correct data type.
CREATE OR REPLACE TEMP TABLE got_characters AS
SELECT
LOWER(TRIM(character)) AS normalized_character,
LOWER("actor/ess") AS actor_ess,
CAST(episodes_appeared AS INTEGER) AS episodes_appeared,
CAST(first_appearance AS INTEGER) AS first_appearance,
CAST(last_appearance AS INTEGER) AS last_appearance
FROM characters;Validation Queries
-- Check row count
SELECT COUNT(*) AS row_count
FROM got_characters;| row_count |
|---|
| 832 |
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_characters';| column_name | data_type | is_nullable |
|---|---|---|
| normalized_character | VARCHAR | YES |
| actor_ess | VARCHAR | YES |
| episodes_appeared | INTEGER | YES |
| first_appearance | INTEGER | YES |
| last_appearance | INTEGER | YES |
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_characters'
GROUP BY column_name;| column_name | missing_count |
|---|---|
| episodes_appeared | 0 |
| last_appearance | 0 |
| normalized_character | 0 |
| actor_ess | 0 |
| first_appearance | 0 |
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (normalized_character, actor_ess, episodes_appeared, first_appearance, last_appearance)) AS duplicate_count
FROM got_characters;| duplicate_count |
|---|
| 0 |
Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_characters;| normalized_character | actor_ess | episodes_appeared | first_appearance | last_appearance |
|---|---|---|---|---|
| tyrion lannister | peter dinklage | 67 | 2011 | 2019 |
| cersei lannister | lena headey | 62 | 2011 | 2019 |
| daenerys targaryen | emilia clarke | 62 | 2011 | 2019 |
| jon snow | kit harington | 62 | 2011 | 2019 |
| sansa stark | sophie turner | 59 | 2011 | 2019 |
| arya stark | maisie williams | 59 | 2011 | 2019 |
| jaime lannister | nikolaj coster-waldau | 55 | 2011 | 2019 |
| jorah mormont | iain glen | 52 | 2011 | 2019 |
| samwell tarly | john bradley | 48 | 2011 | 2019 |
| theon greyjoy | alfie allen | 47 | 2011 | 2019 |
The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below to explicitly set column types, preventing pandas from guessing them. Additionally, we rename each column name to standardize them.
character_type_spec = {
'Character': 'string',
'Actor/ess': 'string',
'Episodes_appeared': 'int32',
'First_appearance': 'int32',
'Last_appearance': 'int32',
}
# Read the CSV file
raw_characters = pd.read_csv(
"data/characters_v4.csv",
dtype=character_type_spec
)
got_characters = (
raw_characters.rename(columns = str.lower)
.rename(columns = lambda x: x.replace( ' ', '_' ))
)Validation Queries
# Check data types
got_characters.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 832 entries, 0 to 831
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 character 819 non-null string
1 actor/ess 832 non-null string
2 episodes_appeared 832 non-null int32
3 first_appearance 832 non-null int32
4 last_appearance 832 non-null int32
dtypes: int32(3), string(2)
memory usage: 22.9 KB
# Check missing values
got_characters.isnull().sum()character 13
actor/ess 0
episodes_appeared 0
first_appearance 0
last_appearance 0
dtype: int64
# Check for duplicates
(got_characters.groupby('character')
.size()
.reset_index(name='count')
.query('count > 1')) character count
24 Baby Sam 4
30 Baratheon Soldier 3
35 Bathhouse Prostitute 2
40 Beric Dondarrion 2
55 Bolton Soldier 3
.. ... ...
640 Wight 6
644 Wildling 9
647 Wildling Archer 2
651 Willa 2
657 Winter Town Woman 2
[66 rows x 2 columns]
# There are duplicates due a general character appearing multiple times in a single episode or a character was played by different actors in an episode.Import a look-up table to map locations to their corresponding regions
We required a look-up table to get the relationship between death locations and map regions. Due to insufficient data, we used Perplexity to generate the table. Our process is described below:
First, we extracted the individual lists of death locations and map regions.
# Retrieve death location list
distinct( got_deaths, location )# Retrieve map region list
got_locations |>
select(name)--list of death locations
SELECT DISTINCT location
FROM raw_deaths;| Location |
|---|
| Winterfell |
| Harrenhal |
| Qarth |
| Valyria |
| Highgarden |
| Beyond the Wall |
| Eastern Road |
| The Eyrie |
| Storm’s End |
| Dorne |
--list of map regions
SELECT name
FROM locations;| name |
|---|
| Astapor |
| Braavos |
| Dorne |
| Dragonstone |
| Dreadfort |
| Eastwatch |
| Harrenhal |
| King’s Landing |
| Last Hearth |
| Meereen |
# Extract location names from the death data
location_list = (
got_deaths['location']
.drop_duplicates(keep = 'first')
.reset_index(drop = True)
)
location_list0 Beyond the Wall
1 Winterfell
2 King's Landing
3 Pentos
4 Kingsroad
5 Eastern Road
6 The Eyrie
7 Vaes Dothrak
8 Castle Black
9 Lhazar
10 Red Waste
11 Unknown
12 Dragonstone
13 Riverlands
14 Oxcross
15 Harrenhal
16 Storm's End
17 Qarth
18 Robb Stark's camp
19 Astapor
20 Riverrun
21 Dreadfort
22 Hollow Hill
23 The Wall
24 Yunkai
25 The Gift
26 The Twins
27 Meereen
28 Mole's Town
29 Moat Cailin
30 The Vale
31 Braavos
32 Dorne
33 Valyria
34 Hardhome
35 Iron Islands
36 The Narrow Sea
37 Casterly Rock
38 Highgarden
39 Roseroad
40 Last Hearth
Name: location, dtype: string
# Extract region names from the map location data
region_list = got_locations.filter(items = ['name'])
region_list name
0 Astapor
1 Braavos
2 Dorne
3 Dragonstone
4 Dreadfort
5 Eastwatch
6 Harrenhal
7 King's Landing
8 Last Hearth
9 Meereen
10 Moat Cailin
11 Oldtown
12 Pentos
13 Pyke
14 Qarth
15 Riverrun
16 The Eyrie
17 The Twins
18 The Wall
19 Vaes Dothrak
20 Winterfell
21 Yunkai
Secondly, We used Perplexity to generate the look-up table. In our prompt, we included the complete lists of locations and regions and specified the desired output format to ensure clarity and ease of use.
Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not have a direct one-to-one mapping, use your best knowledge to infer its region.
#output format
| location | region |
#location
Beyond the Wall
Winterfell
King's Landing
Pentos
Kingsroad
Eastern Road
The Eyrie
Vaes Dothrak
Castle Black
Lhazar
Red Waste
Unknown
Dragonstone
Riverlands
Oxcross
Harrenhal
Storm's End
Qarth
Robb Stark's camp
Astapor
Riverrun
Dreadfort
Hollow Hill
The Wall
Yunkai
The Gift
The Twins
Meereen
Mole's Town
Moat Cailin
The Vale
Braavos
Dorne
Valyria
Hardhome
Iron Islands
The Narrow Sea
Casterly Rock
Highgarden
Roseroad
Last Hearth
#region
Astapor
Braavos
Dorne
Dragonstone
Dreadfort
Eastwatch
Harrenhal
King's Landing
Last Hearth
Meereen
Moat Cailin
Oldtown
Pentos
Pyke
Qarth
Riverrun
The Eyrie
The Twins
The Wall
Vaes Dothrak
Winterfell
Yunkai
Once the table was generated, we copied it into Excel and saved it as a CSV file, ‘location_region.csv.’ We then imported the CSV file into our project, following the same workflow as the other data sets.
region_location_col_spec = cols(
Region = col_character(),
Location = col_character()
)
got_region_location = read_csv( "data/location_region.csv", col_type = region_location_col_spec ) |>
janitor::clean_names()
got_region_locationValidation Queries
#Check row count
got_region_location |>
summarize(row_count = n())#Check data types
got_region_location |>
glimpse()Rows: 40
Columns: 2
$ location <chr> "Beyond the Wall", "Winterfell", "King's Landing", "Pentos", …
$ region <chr> "The Wall", "Winterfell", "King's Landing", "Pentos", "The No…
#Check missing values
got_region_location |>
summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))#Check for duplicates
got_region_location |>
summarize(duplicate_count = sum(duplicated(got_region_location)))CREATE OR REPLACE TEMP TABLE raw_region_location AS
SELECT * FROM 'data/location_region.csv';We then created the got_region_location table to standardize the mapping of regions and locations. In this step, we used LOWER(TRIM()) to create normalized_region and normalized_location columns, ensuring the region and location names were consistent by converting them to lowercase and removing any extra whitespace. Additionally, we created region and location columns by converting their original values to lowercase for consistency.
CREATE OR REPLACE TEMP TABLE got_region_location AS
SELECT
LOWER(TRIM("Region")) AS normalized_region,
LOWER(TRIM("Location")) AS normalized_location,
FROM raw_region_location;Validation Queries
-- Check row count
SELECT COUNT(*) AS row_count
FROM got_region_location;| row_count |
|---|
| 40 |
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_region_location';| column_name | data_type | is_nullable |
|---|---|---|
| normalized_region | VARCHAR | YES |
| normalized_location | VARCHAR | YES |
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_region_location'
GROUP BY column_name;| column_name | missing_count |
|---|---|
| normalized_region | 0 |
| normalized_location | 0 |
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (normalized_region, normalized_location)) AS duplicate_count
FROM got_region_location;| duplicate_count |
|---|
| 0 |
Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM got_region_location;| normalized_region | normalized_location |
|---|---|
| the wall | beyond the wall |
| winterfell | winterfell |
| king’s landing | king’s landing |
| pentos | pentos |
| the north | kingsroad |
| the vale | eastern road |
| the eyrie | the eyrie |
| vaes dothrak | vaes dothrak |
| the wall | castle black |
| essos | lhazar |
location_region_type_spec = {
'Location' : 'string',
'Region' : 'string'
}
# Read the CSV file
raw_location_region = pd.read_csv(
"data/location_region.csv",
dtype=location_region_type_spec
)
got_location_region = (
raw_location_region.rename( columns=str.lower )
)
got_location_region location region
0 Beyond the Wall The Wall
1 Winterfell Winterfell
2 King's Landing King's Landing
3 Pentos Pentos
4 Kingsroad The North
5 Eastern Road The Vale
6 The Eyrie The Eyrie
7 Vaes Dothrak Vaes Dothrak
8 Castle Black The Wall
9 Lhazar Essos
10 Red Waste Essos
11 Dragonstone Dragonstone
12 Riverlands Riverlands
13 Oxcross Westerlands
14 Harrenhal Harrenhal
15 Storm's End Stormlands
16 Qarth Qarth
17 Robb Stark's camp The North
18 Astapor Astapor
19 Riverrun Riverrun
20 Dreadfort Dreadfort
21 Hollow Hill Riverlands
22 The Wall The Wall
23 Yunkai Yunkai
24 The Gift The North
25 The Twins The Twins
26 Meereen Meereen
27 Mole's Town The North
28 Moat Cailin Moat Cailin
29 The Vale The Vale
30 Braavos Braavos
31 Dorne Dorne
32 Valyria Essos
33 Hardhome Beyond the Wall
34 Iron Islands Pyke
35 The Narrow Sea Essos
36 Casterly Rock Westerlands
37 Highgarden The Reach
38 Roseroad The Reach
39 Last Hearth Last Hearth
Validation Queries
# Check data types
got_location_region.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 location 40 non-null string
1 region 40 non-null string
dtypes: string(2)
memory usage: 768.0 bytes
# Check missing values
got_location_region.isnull().sum()location 0
region 0
dtype: int64
# Check for duplicates
got_location_region.duplicated().sum()np.int64(0)
# Check for duplicate locations
(got_location_region.groupby('location')
.size()
.reset_index(name='count')
.query('count > 1'))Empty DataFrame
Columns: [location, count]
Index: []
Combine Data into One Data Frame
We addressed the issue of the many-to-many relationship between the character and death data sets. Some characters were listed using general terms, such as “Wildling” or “Stark Soldier”, and we didn’t have enough information to match them accurately. To handle this, we decided to remove these general terms. Since they could appear multiple times in the data sets, we removed the duplicate entries from both the character and death data sets.
# Exclude the characters having repeated names
got_characters <- got_characters |>
distinct(character, .keep_all = TRUE)
got_characters# Exclude the deaths having repeated names
got_deaths <- got_deaths |>
distinct(name, .keep_all = TRUE)
got_deathsRemove duplicated character names
--Exclude the characters having repeated names
CREATE OR REPLACE TEMP TABLE got_characters AS
SELECT DISTINCT ON (normalized_character) *
FROM got_characters;SELECT *
FROM got_characters;| normalized_character | actor_ess | episodes_appeared | first_appearance | last_appearance |
|---|---|---|---|---|
| catelyn stark | michelle fairley | 26 | 2011 | 2016 |
| roose bolton | michael mcelhatton | 20 | 2012 | 2016 |
| alliser thorne | owen teale | 19 | 2011 | 2016 |
| gregor ‘the mountain’ clegane | hafþór júlíus björnsson | 17 | 2014 | 2019 |
| olly | brenock o’connor | 17 | 2014 | 2016 |
| wun wun | ian whyte | 16 | 2011 | 2019 |
| yohn royce | rupert vansittart | 13 | 2014 | 2019 |
| high sparrow | jonathan pryce | 12 | 2015 | 2016 |
| nymeria sand | jessica henwick | 8 | 2015 | 2017 |
| marei | josephine gillan | 8 | 2012 | 2019 |
Remove duplicated victim names
--Exclude the deaths having repeated names
CREATE OR REPLACE TEMP TABLE got_deaths AS
SELECT DISTINCT ON (normalized_name) *
FROM got_deaths;SELECT *
FROM got_deaths;| normalized_name | normalized_allegiance | season | episode | normalized_location | killer | killers_house | method | death_no |
|---|---|---|---|---|---|---|---|---|
| robert baratheon | house baratheon of king’s landing | 1 | 7 | king’s landing | Boar | None | Tusk | 34 |
| qotho | dothraki | 1 | 9 | red waste | Jorah Mormont | House Targaryen | Sword | 51 |
| irri | house targaryen | 2 | 6 | qarth | Unknown | Unknown | Unknown | 93 |
| hoster tully | house tully | 3 | 2 | riverrun | None | None | Illness | 190 |
| rickard karstark | house stark | 3 | 5 | riverrun | Robb Stark | House Stark | Sword | 209 |
| horse breeder | night’s watch | 3 | 9 | the gift | Ygritte | Free Folk | Arrow | 223 |
| yunkai soldier | wise masters | 3 | 9 | yunkai | Jorah Mormont | House Targaryen | Sword | 228 |
| unborn stark child | house stark | 3 | 9 | the twins | Lothar Frey | House Frey | Knife | 239 |
| robb stark | house stark | 3 | 9 | the twins | Roose Bolton | House Bolton | Knife | 265 |
| joffrey baratheon | house lannister | 4 | 2 | king’s landing | Olenna Tyrell | House Tyrell | Poison | 285 |
Validation Queries
We ran another duplicate check on the got_deaths table to ensure there are no more duplicate values.
-- Check for duplicates
SELECT
COUNT(*) - COUNT(DISTINCT (normalized_name, normalized_allegiance, season, episode, normalized_location, killer, killers_house, method, death_no)) AS duplicate_count
FROM got_deaths;
--There are no more duplicates after general characters are removed from the dataset| duplicate_count |
|---|
| 0 |
# Exclude rows with repeated character names
got_characters = got_characters.drop_duplicates(subset = 'character', keep = 'first')# Exclude rows with repeated names
got_deaths = got_deaths.drop_duplicates(subset = 'name', keep = 'first')We moved on to the main process of data combination.
First, we joined the death data with the episode data based on the season and episode. Next, we used the victim’s name to join the character data, which provided the number of episodes each victim appeared in. We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1. Finally, we used the look-up table to join the death location with the corresponding map region. Through this process, we created a unified data frame for further analysis and visualization.
deaths_detail_info <- got_deaths |>
left_join( got_episodes, by = c("season", "episode") ) |>
left_join( got_characters, by = join_by( name == character ) ) |>
select( season, episode, title, rating, name, killer, location, episodes_appeared ) |>
# Assume characters without data for episodes_appeared have appeared at least once in a scene
mutate( episodes_appeared = replace_na( episodes_appeared, 1 )) |>
left_join( got_region_location, by = "location" )|>
left_join( got_locations, by = join_by( region == name ) )
deaths_detail_infoInstead of creating a new identifier, we joined the datasets based directly on the season and episode fields from the death and episode data. We then used these fields to join the episode data (got_episodes).
Next, we joined the character data (got_characters) using the normalized_name of each victim, which provided the number of episodes each victim appeared in (episodes_appeared). We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1 using COALESCE.
Following this, we used the death normalized_location to join the got_region_location table, giving us the normalized_region for each death. Finally, we joined the map coordinates (got_locations) based on the region, which allowed us to add geographical information (loc_x and loc_y) for each death event. Through this process, we created a unified data frame for further analysis and visualization.
CREATE OR REPLACE TEMP TABLE deaths_detail_info AS
SELECT
d.season,
d.episode,
r.title,
r.rating,
d.normalized_name,
d.killer,
d.normalized_location,
COALESCE(c.episodes_appeared, 1) AS episodes_appeared,
reg.normalized_region,
loc.loc_x,
loc.loc_y
FROM got_deaths d
LEFT JOIN got_episodes r ON d.season = r.season AND d.episode = r.episode
LEFT JOIN got_characters c ON d.normalized_name = c.normalized_character
LEFT JOIN got_region_location reg ON d.normalized_location = reg.normalized_location
LEFT JOIN got_locations loc ON reg.normalized_region = loc.normalized_name
ORDER BY
d.season,
d.episode,
episodes_appeared DESC;Once the table was created, we tested it to ensure it was functioning as expected.
SELECT *
FROM deaths_detail_info;| season | episode | title | rating | normalized_name | killer | normalized_location | episodes_appeared | normalized_region | loc_x | loc_y |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | winter is coming | 9.1 | gared | White Walker | beyond the wall | 1 | the wall | 6.770 | 3.822 |
| 1 | 1 | winter is coming | 9.1 | will | Ned Stark | winterfell | 1 | winterfell | 5.666 | 6.218 |
| 1 | 1 | winter is coming | 9.1 | waymar royce | White Walker | beyond the wall | 1 | the wall | 6.770 | 3.822 |
| 1 | 1 | winter is coming | 9.1 | jon arryn | Lysa Arryn | king’s landing | 1 | king’s landing | 6.697 | 12.759 |
| 1 | 1 | winter is coming | 9.1 | dothraki man | Dothraki man | pentos | 1 | pentos | 10.395 | 12.728 |
| 1 | 1 | winter is coming | 9.1 | stag | Direwolf | winterfell | 1 | winterfell | 5.666 | 6.218 |
| 1 | 1 | winter is coming | 9.1 | direwolf | Stag | winterfell | 1 | winterfell | 5.666 | 6.218 |
| 1 | 2 | the kingsroad | 8.8 | catspaw assassin | Summer | winterfell | 1 | winterfell | 5.666 | 6.218 |
| 1 | 2 | the kingsroad | 8.8 | mycah | Sandor “the Hound” Clegane | kingsroad | 1 | the north | NA | NA |
| 1 | 2 | the kingsroad | 8.8 | lady | Ned Stark | kingsroad | 1 | the north | NA | NA |
First, we joined the death data with the episode data based on the season and episode. Next, we used the victim’s name to join the character data, which provided the number of episodes each victim appeared in. We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1. Finally, we used the look-up table to join the death location with the corresponding map region. Through this process, we created a unified data frame for further analysis and visualization.
deaths_detail_info = (got_deaths
# Create an identifier `season_episode` for merging with the episode rating data
.assign( identifier = lambda df:
df.apply(lambda row:
f"{row['season']}_{row['episode']}",
axis = 1)
)
# Merge episode rating data using the common identifier (season_episode)
.merge(got_episodes,
on = ['season', 'episode'],
how='left')
# Merge character data based on the character name
.merge(got_characters,
left_on = 'name',
right_on = 'character',
how='left')
# Remain only the relevant columns
.filter(items = ['season', 'episode', 'title', 'rating', 'name', 'killer', 'location', 'episodes_appeared'])
# Fill missing episode appearance values with 1 (due to each character appeared at least once)
.assign(episodes_appeared = lambda df: df['episodes_appeared'].fillna(1) )
# Merge location-region look-up table
.merge(got_location_region,
on = 'location',
how='left')
# Merge map location data based on region
.merge(got_locations,
left_on = 'region',
right_on = 'name',
how='left')
.rename(columns = {
'name_x': 'name'
})
.drop(columns = ['name_y'])
)
deaths_detail_info season episode title ... region fx fy
0 1 1 Winter Is Coming ... The Wall 6.770 3.822
1 1 1 Winter Is Coming ... The Wall 6.770 3.822
2 1 1 Winter Is Coming ... Winterfell 5.666 6.218
3 1 1 Winter Is Coming ... Winterfell 5.666 6.218
4 1 1 Winter Is Coming ... Winterfell 5.666 6.218
.. ... ... ... ... ... ... ...
248 8 5 The Bells ... King's Landing 6.697 12.759
249 8 5 The Bells ... King's Landing 6.697 12.759
250 8 5 The Bells ... King's Landing 6.697 12.759
251 8 5 The Bells ... King's Landing 6.697 12.759
252 8 6 The Iron Throne ... King's Landing 6.697 12.759
[253 rows x 11 columns]
Data Analysis and Visualization
Task1: How do the death of prominent characters influence the ratings?
To analyze the data, we organized it by season, episode, and the number of episodes to identify the prominent characters within each episode. We then grouped the data by season and episode for aggregation. This allows us to calculate the death count for each episode while retaining the episode ratings for further analysis. Upon examining the aggregation results, we observed that not every episode has character deaths. To address this, we joined the aggregated data with the original episode data to include ratings for all episodes and filled the death count with zeros for episodes without any deaths. Finally, we created a scatter plot to visualize the relationship between episode ratings, death counts, and the deaths of prominent characters.
episode_summary <- deaths_detail_info |>
group_by( season, episode ) |>
arrange( desc(episodes_appeared), by_group = TRUE ) |>
summarize(
num_deaths = n(),
ep_rating = first(rating),
prominent_death_episodes_appeared = first(episodes_appeared),
.groups = "drop"
) |>
right_join( got_episodes |>
select( season, episode, rating ),
by = c("season", "episode")
) |>
mutate(
num_deaths = replace_na(num_deaths, 0),
ep_rating = rating,
prominent_death_episodes_appeared = replace_na(prominent_death_episodes_appeared, 0)
) |>
select(season, episode, num_deaths, ep_rating, prominent_death_episodes_appeared) |>
arrange( season, episode )
episode_summary# Adding 1 to ensure the logarithm is valid,
ggplot(episode_summary, aes(x = ep_rating, y = log(num_deaths + 1))) +
geom_point(aes(size = prominent_death_episodes_appeared), alpha = 0.8, color = "blue") +
scale_size_continuous(range = c(1, 8), name = "Appeared Episodes of\n Prominent Characters") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Relationship Between Episode Ratings, Death Count,\n and Prominent Character Deaths",
x = "Episode Rating",
y = "log(Death Count)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.title = element_text(size = 12)
)`geom_smooth()` using formula = 'y ~ x'
Analysis: This scatterplot shows a positive correlation between episode ratings and the number of deaths in the show. Episodes with higher ratings tend to have more deaths, especially as ratings approach 10. The regression line has a slight upward slope, suggesting a mild positive correlation between the two variables. Although the trend is positive, the correlation appears to be weak, as the data points are widely scattered around the line. There are several episodes with both high and low death counts across varying ratings, indicating that while episodes with higher ratings may have slightly more deaths, it is not a strong or consistent pattern. Additionally, the deaths of prominent characters appear to be randomly distributed, making it difficult to identify any clear relationship between these deaths and episode ratings.
To analyze the data, we first organized it by season and episode. Next, we then grouped the data by season and episode to calculate the total number of deaths in each episode, while also identifying the most prominent characters based on their number of episodes appeared before their death. This allows us to calculate the death count for each episode and determine the prominence of deaths using the maximum episodes_appeared value. Upon examining the aggregation results, we observed that not every episode has character deaths. To address this, we joined the aggregated data with the original episode data to include ratings for all episodes and filled the death count with zeros for episodes without any deaths.
WITH
death_count AS (
SELECT
season,
episode,
COUNT(*) AS num_deaths,
MAX(rating) AS ep_rating,
MAX(episodes_appeared) AS prominent_death_episodes_appeared
FROM deaths_detail_info
GROUP BY season, episode
)
SELECT
got_episodes.season,
got_episodes.episode,
got_episodes.rating,
COALESCE(death_count.num_deaths, 0) AS num_deaths,
COALESCE(death_count.prominent_death_episodes_appeared, 0) AS prominent_death_episodes_appeared
FROM got_episodes
LEFT JOIN death_count
ON got_episodes.season = death_count.season
AND got_episodes.episode = death_count.episode
ORDER BY got_episodes.season, got_episodes.episode;| season | episode | rating | num_deaths | prominent_death_episodes_appeared |
|---|---|---|---|---|
| 1 | 1 | 9.1 | 7 | 1 |
| 1 | 2 | 8.8 | 3 | 1 |
| 1 | 3 | 8.7 | 0 | 0 |
| 1 | 4 | 8.8 | 1 | 1 |
| 1 | 5 | 9.1 | 5 | 5 |
| 1 | 6 | 9.2 | 5 | 5 |
| 1 | 7 | 9.2 | 1 | 7 |
| 1 | 8 | 9.0 | 8 | 6 |
| 1 | 9 | 9.6 | 4 | 6 |
| 1 | 10 | 9.5 | 3 | 10 |
To analyze the data, we organized it by season, episode, and the number of episodes to identify the prominent characters within each episode. We then grouped the data by season and episode for aggregation. This allows us to calculate the death count for each episode while retaining the episode ratings for further analysis. Finally, we created a scatter plot to visualize the relationship between episode ratings, death counts, and the deaths of prominent characters.
episode_summary = ( deaths_detail_info
.sort_values(by = ['season', 'episode', 'episodes_appeared'],
ascending=[True, True, False])
.groupby(['season', 'episode'])
.agg(
num_deaths = ('name', 'count'),
rating = ('rating', 'first' ),
episodes_appeared = ( 'episodes_appeared', 'first' )
)
.rename(columns = {'rating': 'ep_rating'})
.merge(
got_episodes[['season', 'episode', 'rating']],
on=['season', 'episode'],
how='right'
)
.fillna({
'num_deaths': 0,
'episodes_appeared': 0
})
.filter(items = ['season', 'episode', 'num_deaths', 'episodes_appeared', 'rating'])
)
episode_summary season episode num_deaths episodes_appeared rating
0 1 1 7.0 1.0 9.1
1 1 2 3.0 1.0 8.8
2 1 3 0.0 0.0 8.7
3 1 4 1.0 1.0 8.8
4 1 5 5.0 5.0 9.1
.. ... ... ... ... ...
68 8 2 1.0 3.0 7.9
69 8 3 8.0 52.0 7.5
70 8 4 2.0 38.0 5.5
71 8 5 11.0 62.0 6.0
72 8 6 1.0 62.0 4.1
[73 rows x 5 columns]
# Create figure
plt.figure(figsize=(12, 8))
# Scaling factor for point size
SIZE_SCALE = 10
# Create scatterplot with varying sizes
scatter = plt.scatter( x = episode_summary['rating'],
y = np.log(episode_summary['num_deaths']),
s = episode_summary['episodes_appeared'] * SIZE_SCALE,
alpha = 0.8,
color = 'blue')/stor/home/lh36263/.virtualenvs/r-reticulate/lib/python3.9/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
# Calculate the regression line
z = np.polyfit(episode_summary['rating'], np.log1p(episode_summary['num_deaths']), 1)
p = np.poly1d(z)
# Add regression line
plt.plot(episode_summary['rating'],
p(episode_summary['rating']),
color = 'blue',
linestyle = '-')
plt.title('Relationship Between Episode Ratings, Death Count, and Prominent Character Deaths')
plt.xlabel('Episode Rating', fontsize = 12)
plt.ylabel('log(Death Count)', fontsize = 12)
# Add legends
handles, labels = scatter.legend_elements(prop = "sizes",
alpha = 0.6,
num = 5,
func = lambda x: x/SIZE_SCALE) # reverse the size scaling
plt.legend( handles,
labels,
title = "Appeared Episodes of\nProminent Characters")
# Show the plot
plt.show()Analysis: This scatterplot shows a positive correlation between episode ratings and the number of deaths in the show. Episodes with higher ratings tend to have more deaths, especially as ratings approach 10. The regression line has a slight upward slope, suggesting a mild positive correlation between the two variables. Although the trend is positive, the correlation appears to be weak, as the data points are widely scattered around the line. There are several episodes with both high and low death counts across varying ratings, indicating that while episodes with higher ratings may have slightly more deaths, it is not a strong or consistent pattern. Additionally, the deaths of prominent characters appear to be randomly distributed, making it difficult to identify any clear relationship between these deaths and episode ratings.
Task2: Which regions are targeted most often?
To identify the most frequently targeted regions, we first grouped the death data by region and then summarized it by counting the number of deaths in each region. We also kept the location coordinates for each region for further visualization. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.
death_location <- deaths_detail_info |>
group_by( region ) |>
summarize( death_count = n(),
fx = first(fx),
fy = first(fy)) |>
drop_na(fx, fy)
death_locationTo better understand the results, we visualized the data points on the map using a scatterplot. The map serves as the background, with each point representing a victim’s location. The size of the dots allows viewers to easily identify patterns in victim numbers across different regions.
# Read map image
map_raw_img <- image_read( got_map_url ) |>
image_convert( format = "png" )
# Create a raster map with brightness adjustment
raster_map_img <- map_raw_img |>
image_modulate( brightness = 120) |>
as.raster()
# Calculate the image dimension
info <- image_info(map_raw_img)
IMAGE_DPI <- 96 # 96 dpi
width_in = info$width / IMAGE_DPI
height_in = info$height / IMAGE_DPI
# Plot the death count at their locations on the map
ggplot( death_location, aes( x = fx, y = height_in - fy ) ) +
annotation_raster(raster_map_img, xmin = 0, xmax = width_in, ymin = 0, ymax = height_in) +
geom_point( aes( size = death_count, color = death_count )) +
scale_color_gradient(low = "#FFB7B2",
high = "#8A3A33") +
scale_x_continuous(limits = c(0, width_in )) +
scale_y_continuous(limits = c( 0, height_in)) +
scale_size(range = c(2, 10)) +
labs(title = "Targeted Region Across Ice and Fire World", x = "", y = "") +
theme(legend.position = "none") +
coord_fixed(ratio = 1)Analysis: This map highlights regions with the highest death counts across the Game of Thrones world, with the largest concentration of deaths occurring in the central area of Westeros, specifically around King’s Landing. This distribution suggests that major conflicts and battles are concentrated in and around key political centers.
To identify the most frequently targeted regions, we first grouped the death data by region and then summarized it by counting the number of deaths in each region. We also kept the location coordinates for each region for further analysis. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.
SELECT
normalized_region AS region,
COUNT(*) AS death_count,
MAX(loc_x) AS fx,
MAX(loc_y) AS fy
FROM deaths_detail_info
WHERE loc_x IS NOT NULL AND loc_y IS NOT NULL
GROUP BY region;| region | death_count | fx | fy |
|---|---|---|---|
| riverrun | 4 | 5.051 | 11.061000 |
| pyke | 1 | 3.229 | 10.572000 |
| last hearth | 1 | 7.025 | 4.911000 |
| vaes dothrak | 3 | 23.726 | 11.686000 |
| dragonstone | 6 | 8.311 | 11.967000 |
| astapor | 4 | 19.196 | 17.341999 |
| moat cailin | 1 | 5.947 | 8.145001 |
| braavos | 2 | 10.342 | 9.363000 |
| winterfell | 31 | 5.666 | 6.218000 |
| king’s landing | 53 | 6.697 | 12.759000 |
To identify the most frequently targeted regions, we first grouped the death data by region and then aggregated it by counting the number of deaths in each region. We also kept the location coordinates for each region for further visualization. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.
death_location = ( deaths_detail_info
.groupby('region')
.agg(
death_count=('region', 'count'),
loc_x=('fx', 'first'),
loc_y=('fy', 'first') )
.dropna(subset=['loc_x', 'loc_y'])
)
death_location death_count loc_x loc_y
region
Astapor 4 19.196 17.342
Braavos 2 10.342 9.363
Dorne 9 8.416 16.529
Dragonstone 6 8.311 11.967
Dreadfort 3 7.280 5.999
Harrenhal 3 6.218 11.519
King's Landing 54 6.697 12.759
Last Hearth 1 7.025 4.911
Meereen 18 19.935 15.592
Moat Cailin 1 5.947 8.145
Pentos 1 10.395 12.728
Pyke 1 3.229 10.572
Qarth 8 25.580 18.685
Riverrun 4 5.051 11.061
The Eyrie 2 7.030 10.426
The Twins 10 5.260 9.811
The Wall 40 6.770 3.822
Vaes Dothrak 3 23.726 11.686
Winterfell 32 5.666 6.218
Yunkai 3 19.550 16.102
To better understand the results, we visualized the data points on the map using a scatterplot. The map serves as the background, with each point representing a victim’s location. The size of the dots allows viewers to easily identify patterns in victim numbers across different regions.
# Read map image
response = requests.get(got_map_url[0])
map_raw_img = Image.open(BytesIO(response.content))
map_raw_img<PIL.PngImagePlugin.PngImageFile image mode=RGB size=3600x2400 at 0x7F1CDE197A90>
# Create a raster map with brightness adjustment
enhancer = ImageEnhance.Brightness(map_raw_img)
raster_map_img = enhancer.enhance(1.2)
# Calculate image dimensions
IMAGE_DPI = 96
width_in = map_raw_img.size[0] / IMAGE_DPI
height_in = map_raw_img.size[1] / IMAGE_DPI
# Create the plot
plt.figure(figsize=(width_in, height_in), dpi=IMAGE_DPI)
# Add the map as background
plt.imshow(raster_map_img, extent=[0, width_in, 0, height_in])
# Plot the points
scatter = plt.scatter(
death_location['loc_x'],
height_in - death_location['loc_y'], # flip Y axis to align coordinate
s = death_location['death_count'] * 100, # adjust point size
c = death_location['death_count'],
alpha = 0.9
)Analysis: This map highlights regions with the highest death counts across the Game of Thrones world, with the largest concentration of deaths occurring in the central area of Westeros, specifically around King’s Landing. This distribution suggests that major conflicts and battles are concentrated in and around key political centers.
Challenges
Handling Many-to-Many Mappings
While processing the data, we encountered issues when trying to merge the death and character datasets, leading to duplicate names and causing a many-to-many mapping warning. This issue occurred because the original character names included repeated entries, specifically due to the presence of non-individual character names, such as Wildling, which represented a collective group rather than a specific character. After consulting with Dr. Howison, we decided to remove the duplicated names from the dataset, which resolved the many-to-many conflict and allowed for a cleaner join between the datasets.
Correcting Coordinate Alignment on the Map
During the creation of a scatterplot showing death count points on the map, we noticed that some points were located in the sea rather than on land. Identifying the issue, we analyzed the reason by examining the map location data to understand how the coordinates were defined. We discovered that the coordinates followed a standard 2D screen coordinate system, in which the origin is at the top-left corner. Additionally, the map’s scale was specified in the notes, indicating that the DPI was 96. With this information, we transformed our coordinates to align with the data’s coordinate system. This adjustment allowed us to correctly plot the points on the map.
Bridging the Gap Between Data sets
We encountered a challenge in joining the death data and map location data due to a lack of sufficient information. After searching online for a while, we realized that acquiring this data might not be possible. We then considered leveraging the power of AI to bridge the gap. While using AI introduced some uncertainty in the information, we manually examined the results. After several checks, we determined that the AI-generated data could serve as a useful source to help connect the data sets. Given that there were 41 rows involved, this approach reduced the effort needed to assign and search for data manually.
Loading JSON data into DuckDB
Another challenge we faced was loading JSON data into DuckDB while working on the SQL part. Initially, we encountered errors when trying to use
dbWriteTable()with data loaded usingfromJSON()and we found out that this was due to the presence of nested structures. ThefromJSON()function often returns a list with hierarchical data, whiledbWriteTable()requires a flat data frame format, which DuckDB cannot handle directly if the input contains nested lists. We solved this issue by flattening the JSON data into a data frame format, ensuring all nested elements were converted into a tabular structure. Once flattened, the data was compatible withdbWriteTable(), allowing us to successfully write it to DuckDB.From Dr. Howison’s feedback on using native DuckDB reading methods, we then revised our approach. Instead of relying on external R functions like
fromJSON()anddbWriteTable(), we utilized DuckDB’s nativeread_json_autofunction to directly load the JSON file into a DuckDB table. This method handled the hierarchical JSON structures automatically, eliminating the need for manual flattening. Additionally, we used theUNNESTfunction to process the nestedlocationsfield, creating individual rows for each object in the array. By extracting specific fields with the->>operator, we transformed the data into a usable table namedlocations.Data Consistency and Matching During Joins
While also working on the SQL part, the region column was empty after joining, indicating that the location values were not matching between tables. We realized that SQL joins are typically case-sensitive, causing issues when values do not match due to capitalization differences, and extra spaces in the location column between tables led to join failures. We solved this problem by applying transformations to standardize the values before performing the join. Specifically, we used functions like
LOWER()to convert all text to lowercase andTRIM()to remove any extra spaces. This ensured that values from both tables matched regardless of case or formatting inconsistencies, allowing for a successful join and populating the region column correctly.
Takeaways
Comparison of different methods
From learning about three different methods (R, SQL, Python) in class to process and visualize data, we found it helpful to use all three in this project to gain a better understanding of their strengths and capabilities. We found that R was particularly effective for data visualization and statistical analysis. The use of libraries such as dplyr for data manipulation and ggplot2 for creating detailed visualizations made R a useful tool for exploratory data analysis and deriving insights. We found that SQL was highly efficient for data processing, including data extraction, aggregation, and joining. Its ability to handle large datasets directly within the database environment made it important for the initial stages of data preparation and summary. Python struck a balance between SQL and R, offering both powerful data manipulation and the ability to easily integrate with other Python libraries, such as pandas for data wrangling and matplotlib for visualizations. We find this flexibility made Python particularly effective for combining data transformation tasks with in-depth analysis and visualization. By utilizing SQL for data preparation, R for visualization, and Python for flexible data processing, we were able to leverage each tool’s strengths to create a more efficient and comprehensive data analysis workflow.
Planning Before Starting the Project
Another takeaway was the importance of planning before starting the project. Initially, we underestimated the time and effort needed to clean and transform the data. We faced issues due to insufficient planning for data preparation steps, specifically not combining the datasets into one cohesive dataset when working on the R portion. Instead, we pulled data from individual tables for our analysis, which led to inefficiencies in our workflow. This approach eventually required us to revisit and redo parts of the R analysis, combining all datasets into one while simultaneously working on the SQL and Python portions of the project. This experience taught us the value of thorough planning, especially for data integration, to ensure all relevant datasets are properly prepared from the beginning. Moving forward, we learned that spending more time on the initial planning and data preparation phase will allow for smoother transitions between different analysis tools and more effective in-depth analysis.
Importance of Carrying Out Validation Queries
After our meeting with Dr. Howison, we realized the importance of carrying out validation queries in the project, especially when using multiple methods (R, SQL, and Python) to process and analyze data, as well as when combining datasets to ensure data integrity. While merging tables and cleaning data, running validation checks helped catch inconsistencies early. For example, it was important to check row counts to ensure no unintended data loss occurred, verify data types to maintain consistency across fields, and handle missing values appropriately to prevent gaps in our analysis. Additionally, checking for duplicates ensured that there were no unintended duplicate records, which could impact the accuracy of our results. These validation steps ensured that key metrics, such as death counts or character details, were accurately maintained across datasets. We learned that performing these checks was crucial in ensuring that our subsequent analyses were built on a solid, reliable foundation.